{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<p style=\"text-align:center\">\n",
    "<img alt=\"phoenix logo\" src=\"https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg\" width=\"1000\"/>\n",
    "<br>\n",
    "<br>\n",
    "<a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "|\n",
    "<a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "|\n",
    "<a href=\"https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg?__hstc=259489365.a667dfafcfa0169c8aee4178d115dc81.1733501603539.1733501603539.1733501603539.1&__hssc=259489365.1.1733501603539&__hsfp=3822854628&submissionGuid=381a0676-8f38-437b-96f2-fc10875658df#/shared-invite/email\">Community</a>\n",
    "</p>\n",
    "</center>\n",
    "<h1 align=\"center\">Comparing Prompt Optimization Techniques</h1>\n",
    "\n",
    "This tutorial will use Phoenix to compare the performance of different prompt optimization techniques.\n",
    "\n",
    "You'll start by creating an experiment in Phoenix that can house the results of each of your resulting prompts. Next you'll use a series of prompt optimization techniques to improve the performance of a jailbreak classification task. Each technique will be applied to the same base prompt, and the results will be compared using Phoenix.\n",
    "\n",
    "The techniques you'll use are:\n",
    "- **Few Shot Examples**: Adding a few examples to the prompt to help the model understand the task.\n",
    "- **Meta Prompting**: Prompting a model to generate a better prompt based on previous inputs, outputs, and expected outputs.\n",
    "- **Prompt Gradients**: Using the gradient of the prompt to optimize individual components of the prompt using embeddings.\n",
    "- **DSPy Prompt Tuning**: Using DSPy, an automated prompt tuning library, to optimize the prompt.\n",
    "\n",
    "⚠️ This tutorial requires and OpenAI API key.\n",
    "\n",
    "Let's get started!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setup Dependencies & Keys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q \"arize-phoenix>=8.0.0\" datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next you need to connect to Phoenix. The code below will connect you to a Phoenix Cloud instance. You can also [connect to a self-hosted Phoenix instance](https://arize.com/docs/phoenix/deployment) if you'd prefer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "\n",
    "os.environ[\"PHOENIX_COLLECTOR_ENDPOINT\"] = \"https://app.phoenix.arize.com\"\n",
    "if not os.environ.get(\"PHOENIX_CLIENT_HEADERS\"):\n",
    "    os.environ[\"PHOENIX_CLIENT_HEADERS\"] = \"api_key=\" + getpass(\"Enter your Phoenix API key: \")\n",
    "\n",
    "if not os.environ.get(\"OPENAI_API_KEY\"):\n",
    "    os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API key: \")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Dataset into Phoenix\n",
    "\n",
    "Since we'll be running a series of experiments, we'll need a dataset of test cases that we can run each time. This dataset will be used to test the performance of each prompt optimization technique."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "ds = load_dataset(\"jackhhao/jailbreak-classification\")[\"train\"]\n",
    "ds = ds.to_pandas().sample(50)\n",
    "ds.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import uuid\n",
    "\n",
    "import phoenix as px\n",
    "from phoenix.client import Client as PhoenixClient\n",
    "\n",
    "unique_id = uuid.uuid4()\n",
    "\n",
    "# Upload the dataset to Phoenix\n",
    "dataset = px.Client().upload_dataset(\n",
    "    dataframe=ds,\n",
    "    input_keys=[\"prompt\"],\n",
    "    output_keys=[\"type\"],\n",
    "    dataset_name=f\"jailbreak-classification-{unique_id}\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, you can define a base template for the prompt. We'll also save this template to Phoenix, so it can be tracked, versioned, and reused across experiments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openai import OpenAI\n",
    "from openai.types.chat.completion_create_params import CompletionCreateParamsBase\n",
    "\n",
    "from phoenix.client.types import PromptVersion\n",
    "\n",
    "params = CompletionCreateParamsBase(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    temperature=0,\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": \"You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output 'benign' or 'jailbreak', no other words.\",\n",
    "        },\n",
    "        {\"role\": \"user\", \"content\": \"{{prompt}}\"},\n",
    "    ],\n",
    ")\n",
    "\n",
    "prompt_identifier = \"jailbreak-classification\"\n",
    "\n",
    "prompt = PhoenixClient().prompts.create(\n",
    "    name=prompt_identifier,\n",
    "    prompt_description=\"A prompt for classifying whether a given prompt is a jailbreak risk.\",\n",
    "    version=PromptVersion.from_openai(params),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should now see that prompt in Phoenix:\n",
    "\n",
    "![Prompt in Phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/images/prompts-screenshot-prompt-optimization-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next you'll need a task and evaluator for the experiment. A task is a function that will be run across each example in the dataset. The task is also the piece of your code that you'll change between each run of the experiment. To start off, the task is simply a call to GPT 3.5 Turbo with a basic prompt.\n",
    "\n",
    "You'll also need an evaluator that will be used to test the performance of the task. The evaluator will be run across each example in the dataset after the task has been run. Here, because you have ground truth labels, you can use a simple function to check if the output of the task matches the expected output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_prompt(input):\n",
    "    client = OpenAI()\n",
    "    resp = client.chat.completions.create(**prompt.format(variables={\"prompt\": input[\"prompt\"]}))\n",
    "    return resp.choices[0].message.content.strip()\n",
    "\n",
    "\n",
    "def evaluate_response(output, expected):\n",
    "    return output.lower() == expected[\"type\"].lower()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also instrument your code to send all models calls to Phoenix. This isn't necessary for the experiment to run, but it does mean all your experiment task runs will be tracked in Phoenix. The overall experiment score and evaluator runs will be tracked regardless of whether you instrument your code or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openinference.instrumentation.openai import OpenAIInstrumentor\n",
    "\n",
    "from phoenix.otel import register\n",
    "\n",
    "tracer_provider = register(project_name=\"prompt-optimization\")\n",
    "OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can run the initial experiment. This will be the base prompt that you'll be optimizing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "from phoenix.experiments import run_experiment\n",
    "\n",
    "nest_asyncio.apply()\n",
    "\n",
    "initial_experiment = run_experiment(\n",
    "    dataset,\n",
    "    task=test_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Initial base prompt\",\n",
    "    experiment_name=\"initial-prompt\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + prompt.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should now see the initial experiment results in Phoenix:\n",
    "\n",
    "![1st experiment results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/prompts-nb-experiment.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Optimization Technique #1: Few Shot Examples\n",
    "\n",
    "One common prompt optimization technique is to use few shot examples to guide the model's behavior.\n",
    "\n",
    "Here you can add few shot examples to the prompt to help improve performance. Conviently, the dataset you uploaded in the last step contains a test set that you can use for this purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datasets import load_dataset\n",
    "\n",
    "ds_test = load_dataset(\"jackhhao/jailbreak-classification\")[\n",
    "    \"test\"\n",
    "]  # this time, load in the test set instead of the training set\n",
    "few_shot_examples = ds_test.to_pandas().sample(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a new prompt that includes the few shot examples. Prompts in Phoenix are automatically versioned, so saving the prompt with the same name will create a new version that can be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "few_shot_template = \"\"\"\n",
    "You are an evaluator that decides whether a given prompt is a jailbreak risk. Only output \"benign\" or \"jailbreak\", no other words.\n",
    "\n",
    "Here are some examples of prompts and responses:\n",
    "\n",
    "{examples}\n",
    "\"\"\"\n",
    "\n",
    "params = CompletionCreateParamsBase(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    temperature=0,\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": few_shot_template.format(examples=few_shot_examples)},\n",
    "        {\"role\": \"user\", \"content\": \"{{prompt}}\"},\n",
    "    ],\n",
    ")\n",
    "\n",
    "few_shot_prompt = PhoenixClient().prompts.create(\n",
    "    name=prompt_identifier,\n",
    "    prompt_description=\"Few shot prompt\",\n",
    "    version=PromptVersion.from_openai(params),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice you now have a new version of the prompt in Phoenix:\n",
    "\n",
    "![Few shot prompt in Phoenix](https://storage.googleapis.com/arize-phoenix-assets/assets/images/prompt-versioning-nb.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define a new task with your new prompt:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_prompt(input):\n",
    "    client = OpenAI()\n",
    "    prompt_vars = {\"prompt\": input[\"prompt\"]}\n",
    "    resp = client.chat.completions.create(**few_shot_prompt.format(variables=prompt_vars))\n",
    "    return resp.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can run another experiment with the new prompt. The dataset of test cases and the evaluator will be the same as the previous experiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "few_shot_experiment = run_experiment(\n",
    "    dataset,\n",
    "    task=test_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Prompt Optimization Technique #1: Few Shot Examples\",\n",
    "    experiment_name=\"few-shot-examples\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + few_shot_prompt.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Optimization Technique #2: Meta Prompting\n",
    "\n",
    "Meta prompting involves prompting a model to generate a better prompt, based on previous inputs, outputs, and expected outputs.\n",
    "\n",
    "The experiment from round 1 serves as a great starting point for this technique, since it has each of those components."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Access the experiment results from the first round as a dataframe\n",
    "ground_truth_df = initial_experiment.as_dataframe()\n",
    "\n",
    "# Sample 10 examples to use as meta prompting examples\n",
    "ground_truth_df = ground_truth_df[:10]\n",
    "\n",
    "# Create a new column with the examples in a single string\n",
    "ground_truth_df[\"example\"] = ground_truth_df.apply(\n",
    "    lambda row: f\"Input: {row['input']}\\nOutput: {row['output']}\\nExpected Output: {row['expected']}\",\n",
    "    axis=1,\n",
    ")\n",
    "ground_truth_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now construct a new prompt that will be used to generate a new prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_prompt = \"\"\"\n",
    "You are an expert prompt engineer. You are given a prompt, and a list of examples.\n",
    "\n",
    "Your job is to generate a new prompt that will improve the performance of the model.\n",
    "\n",
    "Here are the examples:\n",
    "\n",
    "{examples}\n",
    "\n",
    "Here is the original prompt:\n",
    "\n",
    "{prompt}\n",
    "\n",
    "Here is the new prompt:\n",
    "\"\"\"\n",
    "\n",
    "original_base_prompt = (\n",
    "    prompt.format(variables={\"prompt\": \"example prompt\"}).get(\"messages\")[0].get(\"content\")\n",
    ")\n",
    "\n",
    "client = OpenAI()\n",
    "response = client.chat.completions.create(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": meta_prompt.format(\n",
    "                prompt=original_base_prompt, examples=ground_truth_df[\"example\"].to_string()\n",
    "            ),\n",
    "        }\n",
    "    ],\n",
    ")\n",
    "new_prompt = response.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now save that as a prompt in Phoenix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if r\"\\{examples\\}\" in new_prompt:\n",
    "    new_prompt = new_prompt.format(examples=few_shot_examples)\n",
    "\n",
    "params = CompletionCreateParamsBase(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    temperature=0,\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": new_prompt},\n",
    "        {\"role\": \"user\", \"content\": \"{{prompt}}\"},\n",
    "    ],\n",
    ")\n",
    "\n",
    "meta_prompt_result = PhoenixClient().prompts.create(\n",
    "    name=prompt_identifier,\n",
    "    prompt_description=\"Meta prompt result\",\n",
    "    version=PromptVersion.from_openai(params),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run this new prompt through the same experiment\n",
    "Redefine the task, using the new prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_prompt(input):\n",
    "    client = OpenAI()\n",
    "    resp = client.chat.completions.create(\n",
    "        **meta_prompt_result.format(variables={\"prompt\": input[\"prompt\"]})\n",
    "    )\n",
    "    return resp.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta_prompting_experiment = run_experiment(\n",
    "    dataset,\n",
    "    task=test_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Prompt Optimization Technique #2: Meta Prompting\",\n",
    "    experiment_name=\"meta-prompting\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + meta_prompt_result.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Optimization Technique #3: Prompt Gradient Optimization\n",
    "\n",
    "Prompt gradient optimization is a technique that uses the gradient of the prompt to optimize individual components of the prompt using embeddings. It involves:\n",
    "1. Converting the prompt into an embedding.\n",
    "2. Comparing the outputs of successful and failed prompts to find the gradient direction.\n",
    "3. Moving in the gradient direction to optimize the prompt.\n",
    "\n",
    "Here you'll define a function to get embeddings for prompts, and then use that function to calculate the gradient direction between successful and failed prompts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "# First we'll define a function to get embeddings for prompts\n",
    "def get_embedding(text):\n",
    "    client = OpenAI()\n",
    "    response = client.embeddings.create(model=\"text-embedding-ada-002\", input=text)\n",
    "    return response.data[0].embedding\n",
    "\n",
    "\n",
    "# Function to calculate gradient direction between successful and failed prompts\n",
    "def calculate_prompt_gradient(successful_prompts, failed_prompts):\n",
    "    # Get embeddings for successful and failed prompts\n",
    "    successful_embeddings = [get_embedding(p) for p in successful_prompts]\n",
    "    failed_embeddings = [get_embedding(p) for p in failed_prompts]\n",
    "\n",
    "    # Calculate average embeddings\n",
    "    avg_successful = np.mean(successful_embeddings, axis=0)\n",
    "    avg_failed = np.mean(failed_embeddings, axis=0)\n",
    "\n",
    "    # Calculate gradient direction\n",
    "    gradient = avg_successful - avg_failed\n",
    "    return gradient / np.linalg.norm(gradient)\n",
    "\n",
    "\n",
    "# Get successful and failed examples from our dataset\n",
    "successful_examples = (\n",
    "    ground_truth_df[ground_truth_df[\"output\"] == ground_truth_df[\"expected\"].get(\"type\")][\"input\"]\n",
    "    .apply(lambda x: x[\"prompt\"])\n",
    "    .tolist()\n",
    ")\n",
    "failed_examples = (\n",
    "    ground_truth_df[ground_truth_df[\"output\"] != ground_truth_df[\"expected\"].get(\"type\")][\"input\"]\n",
    "    .apply(lambda x: x[\"prompt\"])\n",
    "    .tolist()\n",
    ")\n",
    "\n",
    "# Calculate the gradient direction\n",
    "gradient = calculate_prompt_gradient(successful_examples[:5], failed_examples[:5])\n",
    "\n",
    "\n",
    "# Function to optimize a prompt using the gradient\n",
    "def optimize_prompt(base_prompt, gradient, step_size=0.1):\n",
    "    # Get base embedding\n",
    "    base_embedding = get_embedding(base_prompt)\n",
    "\n",
    "    # Move in gradient direction\n",
    "    optimized_embedding = base_embedding + step_size * gradient\n",
    "\n",
    "    # Use GPT to convert the optimized embedding back to text\n",
    "    client = OpenAI()\n",
    "    response = client.chat.completions.create(\n",
    "        model=\"gpt-3.5-turbo\",\n",
    "        messages=[\n",
    "            {\n",
    "                \"role\": \"system\",\n",
    "                \"content\": \"You are helping to optimize prompts. Given the original prompt and its embedding, generate a new version that maintains the core meaning but moves in the direction of the optimized embedding.\",\n",
    "            },\n",
    "            {\n",
    "                \"role\": \"user\",\n",
    "                \"content\": f\"Original prompt: {base_prompt}\\nOptimized embedding direction: {optimized_embedding[:10]}...\\nPlease generate an improved version that moves in this embedding direction.\",\n",
    "            },\n",
    "        ],\n",
    "    )\n",
    "    return response.choices[0].message.content.strip()\n",
    "\n",
    "\n",
    "# Test the gradient-based optimization\n",
    "gradient_prompt = optimize_prompt(original_base_prompt, gradient)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gradient_prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if r\"\\{examples\\}\" in gradient_prompt:\n",
    "    gradient_prompt = gradient_prompt.format(examples=few_shot_examples)\n",
    "\n",
    "params = CompletionCreateParamsBase(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    temperature=0,\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": gradient_prompt,\n",
    "        },  # if your meta prompt includes few shot examples, make sure to include them here\n",
    "        {\"role\": \"user\", \"content\": \"{{prompt}}\"},\n",
    "    ],\n",
    ")\n",
    "\n",
    "gradient_prompt = PhoenixClient().prompts.create(\n",
    "    name=prompt_identifier,\n",
    "    prompt_description=\"Gradient prompt result\",\n",
    "    version=PromptVersion.from_openai(params),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run experiment with gradient-optimized prompt\n",
    "Redefine the task, using the new prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def test_gradient_prompt(input):\n",
    "    client = OpenAI()\n",
    "    resp = client.chat.completions.create(\n",
    "        **gradient_prompt.format(variables={\"prompt\": input[\"prompt\"]})\n",
    "    )\n",
    "    return resp.choices[0].message.content.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gradient_experiment = run_experiment(\n",
    "    dataset,\n",
    "    task=test_gradient_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Prompt Optimization Technique #3: Prompt Gradients\",\n",
    "    experiment_name=\"gradient-optimization\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + gradient_prompt.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Optimization Technique #4: Prompt Tuning with DSPy\n",
    "\n",
    "Finally, you can use an optimization library to optimize the prompt, like DSPy. [DSPy](https://github.com/stanfordnlp/dspy) supports each of the techniques you've used so far, and more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q dspy openinference-instrumentation-dspy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DSPy makes a series of calls to optimize the prompt. It can be useful to see these calls in action. To do this, you can instrument the DSPy library using the OpenInference SDK, which will send all calls to Phoenix. This is optional, but it can be useful to have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from openinference.instrumentation.dspy import DSPyInstrumentor\n",
    "\n",
    "DSPyInstrumentor().instrument(tracer_provider=tracer_provider)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you'll setup the DSPy language model and define a prompt classification task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import DSPy and set up the language model\n",
    "import dspy\n",
    "\n",
    "# Configure DSPy to use OpenAI\n",
    "turbo = dspy.LM(model=\"gpt-3.5-turbo\")\n",
    "dspy.settings.configure(lm=turbo)\n",
    "\n",
    "\n",
    "# Define the prompt classification task\n",
    "class PromptClassifier(dspy.Signature):\n",
    "    \"\"\"Classify if a prompt is benign or jailbreak.\"\"\"\n",
    "\n",
    "    prompt = dspy.InputField()\n",
    "    label = dspy.OutputField(desc=\"either 'benign' or 'jailbreak'\")\n",
    "\n",
    "\n",
    "# Create the basic classifier\n",
    "classifier = dspy.Predict(PromptClassifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your classifier can now be used to make predictions as you would a normal LLM. It will expect a `prompt` input and will output a `label` prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "classifier(prompt=ds.iloc[0].prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, DSPy really shines when it comes to optimizing prompts. By defining a metric to measure successful runs, along with a training set of examples, you can use one of many different optimizers built into the library.\n",
    "\n",
    "In this case, you'll use the `MIPROv2` optimizer to find the best prompt for your task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def validate_classification(example, prediction, trace=None):\n",
    "    return example[\"label\"] == prediction[\"label\"]\n",
    "\n",
    "\n",
    "# Prepare training data from previous examples\n",
    "train_data = []\n",
    "for _, row in ground_truth_df.iterrows():\n",
    "    example = dspy.Example(\n",
    "        prompt=row[\"input\"][\"prompt\"], label=row[\"expected\"][\"type\"]\n",
    "    ).with_inputs(\"prompt\")\n",
    "    train_data.append(example)\n",
    "\n",
    "tp = dspy.MIPROv2(metric=validate_classification, auto=\"light\")\n",
    "optimized_classifier = tp.compile(classifier, trainset=train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DSPy takes care of our prompts in this case, however you could still save the resulting prompt value in Phoenix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params = CompletionCreateParamsBase(\n",
    "    model=\"gpt-3.5-turbo\",\n",
    "    temperature=0,\n",
    "    messages=[\n",
    "        {\n",
    "            \"role\": \"system\",\n",
    "            \"content\": optimized_classifier.signature.instructions,\n",
    "        },  # if your meta prompt includes few shot examples, make sure to include them here\n",
    "        {\"role\": \"user\", \"content\": \"{{prompt}}\"},\n",
    "    ],\n",
    ")\n",
    "\n",
    "dspy_prompt = PhoenixClient().prompts.create(\n",
    "    name=prompt_identifier,\n",
    "    prompt_description=\"DSPy prompt result\",\n",
    "    version=PromptVersion.from_openai(params),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run experiment with DSPy-optimized classifier\n",
    "Redefine the task, using the new prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create evaluation function using optimized classifier\n",
    "def test_dspy_prompt(input):\n",
    "    result = optimized_classifier(prompt=input[\"prompt\"])\n",
    "    return result.label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run experiment with DSPy-optimized classifier\n",
    "dspy_experiment = run_experiment(\n",
    "    dataset,\n",
    "    task=test_dspy_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Prompt Optimization Technique #4: DSPy Prompt Tuning\",\n",
    "    experiment_name=\"dspy-optimization\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + dspy_prompt.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prompt Optimization Technique #5: DSPy with GPT-4o\n",
    "\n",
    "In the last example, you used GPT-3.5 Turbo to both run your pipeline, and optimize the prompt. However, you can also use a different model to optimize the prompt, and a different model to run your pipeline.\n",
    "\n",
    "It can be useful to use a more powerful model for your optimization step, and a cheaper or faster model for your pipeline.\n",
    "\n",
    "Here you'll use GPT-4o to optimize the prompt, and keep GPT-3.5 Turbo as your pipeline model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_gen_lm = dspy.LM(\"gpt-4o\")\n",
    "tp = dspy.MIPROv2(\n",
    "    metric=validate_classification, auto=\"light\", prompt_model=prompt_gen_lm, task_model=turbo\n",
    ")\n",
    "optimized_classifier_using_gpt_4o = tp.compile(classifier, trainset=train_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run experiment with DSPy-optimized classifier using GPT-4o\n",
    "Redefine the task, using the new prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create evaluation function using optimized classifier\n",
    "def test_dspy_prompt(input):\n",
    "    result = optimized_classifier_using_gpt_4o(prompt=input[\"prompt\"])\n",
    "    return result.label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run experiment with DSPy-optimized classifier\n",
    "dspy_experiment_using_gpt_4o = run_experiment(\n",
    "    dataset,\n",
    "    task=test_dspy_prompt,\n",
    "    evaluators=[evaluate_response],\n",
    "    experiment_description=\"Prompt Optimization Technique #5: DSPy Prompt Tuning with GPT-4o\",\n",
    "    experiment_name=\"dspy-optimization-gpt-4o\",\n",
    "    experiment_metadata={\"prompt\": \"prompt_id=\" + dspy_prompt.id},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# You're done!\n",
    "\n",
    "And just like that, you've run a series of prompt optimization techniques to improve the performance of a jailbreak classification task, and compared the results using Phoenix.\n",
    "\n",
    "You should have a set of experiments that looks like this:\n",
    "\n",
    "![Experiment Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/prompt-optimization-experiment-screenshot.png)\n",
    "\n",
    "From here, you can check out more [examples on Phoenix](https://arize.com/docs/phoenix/notebooks), and if you haven't already, [please give us a star on GitHub!](https://github.com/Arize-ai/phoenix) ⭐️"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
